American Census Survey data - Chicago Neighborhood;
Latent Profile Analysis: Gaussian Mixture modeling;
Mclust package;
Mplus;
Exploratory Factor Analysis;
library(mclust)
library(tidyverse)
library(nFactors)
Chicago is famous of diverse neighborhood areas. It’s one thing, but how we can measure each neighborhood area based on the socioeconomic status of each area.
The written compose (including step-by-step in coding and explanation) walk us through how to cluster each neighborhood using a modern method: Latent Profile Analysis. However, the method would have some barriers (model selection criteria, homogeneity and separation, normality assumption) in which we need to use another technique to overcome.
Outline:
1- Prepare ACS data
2- What is LPA
3- Barriers of LPA and the Steps to solve it
4- Mapping
5- LPA with covariate(s)
The American Community Survey (ACS) is a unique data product that includes all the estimates and margins of error from geographies that are published for the ACS.
Data contained demographic, social, economic, and housing subject areas. All Detailed Tables for the ACS 1-year or 5-year estimates. Here we deal with the ACS 5-year estimates which are published for all geographic areas, including census tracts, block groups, American Indian areas, core based statistical areas, combined statistical areas, Congressional districts, and state legislative districts.
The summary of the data “Chicago Census Data by Tract_American Census Survey (ACS)”. Data includes Census tract characteristics of 2010-2014 5-year ACS estimates that easy to download from https://data.census.gov/cedsci/.
We will use the following \(11\) variables in the exploratory factor analysis:
Variable name | Labeling |
---|---|
* AgeDependencyRatio: | age dependency ratio |
* LimitedEngProf5andUP: | limited English proficiency |
* LessThanHS: | less than high school in education level |
* Unemployment | |
* PctForeignBorn: | percentage of foreign born |
* FemaleHHPct: | percentage of female per household |
* MedianIncomeHH: | median income per household |
* AllOcc2009Pct: | percentage of all occupants in 2009 |
* PctVacHousing: | percentage of vacant housing |
* PctBelowPoverty: | percentage below poverty |
* PublicAssistancePct: | percentage of public assistance |
<- read.csv("data/Chicago Census Data by Tract_ACS 2015.csv", header = TRUE) %>%
census select(tract, AgeDependencyRatio, LimitedEngProf5andUP, LessThanHS, Unemployment,
PctForeignBorn, FemaleHHPct, MedianIncomeHH, AllOcc2009Pct, PctVacHousing, PctBelowPoverty, PublicAssistancePct)
# recode into % to scale to other variables
<- census %>% mutate(
df PctForeignBorn = PctForeignBorn*100,
FemaleHHPct = FemaleHHPct*100,
# MedianIncomeHH1Kinv = (max(MedianIncomeHH, na.rm=TRUE) - MedianIncomeHH)/1000,
# we would not use 1Kinv because of weird distribution (multi-mode)
MedianIncomeHH1K = MedianIncomeHH/1000,
AllOcc2009Pct = AllOcc2009Pct*100,
PctVacHousing = PctVacHousing*100,
PublicAssistancePct = PublicAssistancePct*100
%>%
) select(tract, AgeDependencyRatio, LimitedEngProf5andUP, LessThanHS, Unemployment,
PctForeignBorn, FemaleHHPct, MedianIncomeHH1K, AllOcc2009Pct, PctVacHousing,
PctBelowPoverty, PublicAssistancePct)<- df.fa <- df[complete.cases(df),] df2
summary(df2)
tract AgeDependencyRatio LimitedEngProf5andUP
Min. : 10100 Min. : 4.8 Min. : 0.00
1st Qu.:160800 1st Qu.: 39.8 1st Qu.: 1.90
Median :351100 Median : 55.0 Median : 8.40
Mean :403974 Mean : 52.9 Mean :13.99
3rd Qu.:670300 3rd Qu.: 65.3 3rd Qu.:24.00
Max. :843900 Max. :147.4 Max. :64.00
LessThanHS Unemployment PctForeignBorn FemaleHHPct
Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. : 0.000
1st Qu.: 8.20 1st Qu.: 7.00 1st Qu.: 3.858 1st Qu.: 8.704
Median :16.30 Median :11.90 Median :15.115 Median :16.502
Mean :18.39 Mean :14.52 Mean :18.659 Mean :20.122
3rd Qu.:26.20 3rd Qu.:19.80 3rd Qu.:31.076 3rd Qu.:29.859
Max. :62.50 Max. :91.90 Max. :71.263 Max. :70.470
MedianIncomeHH1K AllOcc2009Pct PctVacHousing PctBelowPoverty
Min. : 5.00 Min. : 0.00 Min. : 0.000 Min. : 0.40
1st Qu.: 29.40 1st Qu.:53.45 1st Qu.: 7.752 1st Qu.:12.50
Median : 42.32 Median :62.85 Median :11.722 Median :21.80
Mean : 49.66 Mean :62.85 Mean :13.898 Mean :24.07
3rd Qu.: 62.06 3rd Qu.:72.80 3rd Qu.:18.066 3rd Qu.:33.80
Max. :160.46 Max. :94.06 Max. :50.673 Max. :74.20
PublicAssistancePct
Min. : 0.000
1st Qu.: 9.252
Median :22.002
Mean :24.440
3rd Qu.:36.667
Max. :79.851
<- round(Hmisc::rcorr(as.matrix(df2[,-1]))[[1]],2)
cortab ::paged_table(as.data.frame(cortab), options = list(rows.print = 11, cols.print = 11)) rmarkdown
::corrplot(cor(df2[,-1])) corrplot
As we can see, there are some factors having much association (Unemployment, FemaleHHPct, PctBelowPoverty, PublicAssistancePct), meanwhile some is not.
Individuals can be divided into subgroups based on unobservable construct
Latent profile analysis: static, categorical latent variable measured with continuous items
True class membership is unknown
Latent classes are mutually exclusive & exhaustive
Class-specific means in LPA are analogous to factor loadings
Mathematical model
\[f(x_i) = \sum_{j=1}^K \eta_j \prod_{i=1}^p \frac{1}{\sqrt(2\pi\sigma_{ij}^2)}\text{exp}\Big[\frac{-(x_i-\mu_{ij})^2}{\sigma_{ij}^2}\Big]\]
<- as.matrix(df2[,-1])
X <- Mclust(X) mod
We could use both BIC and ICL. But let’s see the BIC firstly
summary(mod$BIC)
Best BIC values:
VVE,9 EVE,8 EVE,7
BIC -60838.39 -60858.68524 -60929.97097
BIC diff 0.00 -20.29489 -91.58063
plot(mod, what = "BIC", ylim = range(mod$BIC[,-(1:2)], na.rm = TRUE),
legendArgs = list(x = "bottomleft"))
It’s the same if I also used integrated complete-data likelihood (ICL) criterion:
<- mclustICL(X)
ICL plot(ICL)
We run Mclust for all 11 indicators, the data-driven would produce up to 8, 9 clusters.
# tabulate class-membership numbers
table(summary(mod)$classification)
1 2 3 4 5 6 7 8 9
68 137 89 38 52 98 55 79 181
<- MclustDR(mod, lambda = 1)
drmod plot(drmod, what = "contour")
Why happening?
# display the means per class
::paged_table(as.data.frame(round(mod$parameters$mean,2)),
rmarkdownoptions = list(rows.print = 11, cols.print = 9))
par(mar = c(4, 4, .1, .1))
plot(density(df2$Unemployment))
plot(density(df2$PctBelowPoverty))
Thus, with the homogeneity and separation concepts:
Homogeneity analogous to concept of saturation in factor analysis
Latent class separation analogous to concept of simple structure in factor analysis
We can have
\(\Rightarrow\) We want to increase the homogeneity. For this purpose, we would use factor analysis to select covariates in the first components.
<- eigen(cor(df.fa[,-1])) # get eigenvalues
ev <- parallel(subject=nrow(df.fa[,-1]), var=ncol(df.fa[,-1]), rep=100,cent=.05)
ap <- nScree(x=ev$values, aparallel=ap$eigen$qevpea)
nS plotnScree(nS)
When we run a factor analysis, we need to decide on three things:
\(\Rightarrow\) Maximum Likelihood
Factor Analysis (bullet 2) entering raw data and extracting 3 factors
(bullet 1), with varimax
rotation (bullet 3)
<- factanal(x = df.fa[,-1], factors = 3, n.obs = 797, rotation="varimax")
fit.f print(fit.f, digits=2, cutoff=.3, sort=TRUE)
Call:
factanal(x = df.fa[, -1], factors = 3, n.obs = 797, rotation = "varimax")
Uniquenesses:
AgeDependencyRatio LimitedEngProf5andUP LessThanHS
0.37 0.00 0.19
Unemployment PctForeignBorn FemaleHHPct
0.29 0.10 0.18
MedianIncomeHH1K AllOcc2009Pct PctVacHousing
0.29 0.21 0.60
PctBelowPoverty PublicAssistancePct
0.12 0.07
Loadings:
Factor1 Factor2 Factor3
AgeDependencyRatio 0.56 0.56
Unemployment 0.80
FemaleHHPct 0.84
MedianIncomeHH1K -0.80
PctVacHousing 0.61
PctBelowPoverty 0.93
PublicAssistancePct 0.96
LimitedEngProf5andUP 0.99
LessThanHS 0.53 0.69
PctForeignBorn 0.91
AllOcc2009Pct 0.89
Factor1 Factor2 Factor3
SS loadings 4.82 2.43 1.33
Proportion Var 0.44 0.22 0.12
Cumulative Var 0.44 0.66 0.78
Test of the hypothesis that 3 factors are sufficient.
The chi square statistic is 222.31 on 25 degrees of freedom.
The p-value is 1.46e-33
In general, we’d like to see low uniquenesses or high communalities (subtract the uniquenesses from 1—The communality is the proportion of variance of the \(i^{th}\) variable contributed by the \(m\) (here’s 3) common factors)
At the conclusion of a factor analysis of census data, we might determine that the census measures 3 factors: (1) “mixed” social poverty disparity strength, (2) education strength, (3) age-housing disparity strength
Factor analysis seeks to model the correlation matrix with fewer variables called factors. If we succeed with, said here, 3 factors, we are able to model the correlation matrix using only 3 variables instead of 11. Just remember these 3 variables, or factors, are unobserved. We give them names like “latent variables”. They are not subsets of our original variables.
Given the data matrix \(X\), consider it’s covariance matrix \(cov(X)=\sum\) and obtain the matrix \(V\) of eigenvectors from \(\sum\). This matrix \(V\) is what you call the loadings.
This set of eigenvectors define an orthogonal change of basis matrix that maximize the variance from X. This means that, if I project X into the subspace generated by V I will obtain a matrix U by simply solving XV=U. This matrix U is what you call the scores (for each factor, we would have its own score).
In our case, we would manipulate a little bit due to we use only variables in the first factor. Thus, if the absolute value of loadings less than or equal 0.5, it would not contribute to the score
We calculate and show FA scores – The concentrated disadvantage index scores:
$concdisadv <- as.matrix(df.fa[,-1]) %*% (fit.f$loadings[,1]*(abs(fit.f$loadings[,1])>0.5))
df.fa# print out the scores
::paged_table(cbind(tract = df.fa[1:20,1],
rmarkdowncondisadv_fascore = df.fa[1:20,13],
1:20,2:12]),
df.fa[options = list(rows.print = 20, cols.print = 13))
The latent profile analysis (normal mixture) will use variables loaded on factor 1 identified from the factor analysis:
1- AgeDependencyRatio
2- Unemployment
3- FemaleHHPct
4- MedianIncomeHH1K
5- PctVacHousing
6- PctBelowPoverty
7- PublicAssistancePct
8- LessThanHS
# recode into % to scale to other variables
<- df2 %>%
df3 select(tract, AgeDependencyRatio, Unemployment, FemaleHHPct,
MedianIncomeHH1K, PctVacHousing, PctBelowPoverty,
PublicAssistancePct, LessThanHS )
<- as.matrix(df3[,-1])
X <- mclustICL(X)
ICL summary(ICL)
Best ICL values:
VVV,3 VVV,4 VVE,7
ICL -45689.17 -45706.83796 -45709.47096
ICL diff 0.00 -17.66358 -20.29659
plot(ICL)
In the above Mclust() function call, the number of mixing components and the covariance parameterization are selected using the integrated complete-data likelihood (ICL) criterion. A summary showing the top-three models and a plot of the ICL traces (the first fig.) for all the models considered is then obtained.
Note: with BIC criteria
<- Mclust(X)
modBIC summary(modBIC$BIC)
Best BIC values:
VVE,7 VVE,6 VVV,4
BIC -45583.8 -45597.13834 -45603.11491
BIC diff 0.0 -13.33485 -19.31142
plot(modBIC, what="BIC", ylim=range(modBIC$BIC[,-(1:2)], na.rm=TRUE),
legendArgs = list(x = "bottomright"))
We got at least 4 to 7 classes with the BIC criteria.
Let see the contour plot of estimated mixture densities.
<- MclustDR(mod, lambda = 1)
drmod plot(drmod, what = "contour")
.3 <- Mclust(X, G = 3)
mod.3 <- MclustDR(mod.3, lambda = 1)
drmodplot(drmod.3, what = "contour")
.3 <- Mclust(df2[,-1], G=3, modelNames = "VVV")
modsummary(mod.3)
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust VVV (ellipsoidal, varying volume, shape, and orientation)
model with 3 components:
log-likelihood n df BIC ICL
-29837.77 797 233 -61232.17 -61266.61
Clustering table:
1 2 3
189 329 279
Mixture probabilities and mean (sd in separate table below) for Census tract characteristics
.3 <- summary(mod.3, parameters = TRUE)
sum.modrbind(Component = c("Class 1","Class 2","Class 3"),
`Mix Probability` = paste0(round((sum.mod.3$pro)*100,1),"%"),
round(sum.mod.3$mean,1),
Labels = c("Affluent","Poor","Distressed")
)
[,1] [,2] [,3]
Component "Class 1" "Class 2" "Class 3"
Mix Probability "23.7%" "41.3%" "35%"
AgeDependencyRatio "35.9" "50.3" "67.5"
LimitedEngProf5andUP "6.6" "27.3" "3.3"
LessThanHS "5.3" "24.6" "19.9"
Unemployment "5.3" "11.3" "24.6"
PctForeignBorn "14" "33.4" "4.4"
FemaleHHPct "5.8" "15.5" "35.3"
MedianIncomeHH1K "87.1" "44.9" "29.9"
AllOcc2009Pct "56" "64.1" "66"
PctVacHousing "9" "11" "20.6"
PctBelowPoverty "9.3" "22.5" "35.9"
PublicAssistancePct "5" "21" "41.6"
Labels "Affluent" "Poor" "Distressed"
<- round(sqrt(diag(sum.mod.3$variance[,,1])),1)
sd.class1 <- round(sqrt(diag(sum.mod.3$variance[,,2])),1)
sd.class2 <- round(sqrt(diag(sum.mod.3$variance[,,3])),1)
sd.class3 cbind(sd.class1,sd.class2,sd.class3)
sd.class1 sd.class2 sd.class3
AgeDependencyRatio 18.6 14.1 15.9
LimitedEngProf5andUP 5.1 12.6 5.2
LessThanHS 4.4 14.1 8.3
Unemployment 2.6 4.4 9.8
PctForeignBorn 7.2 11.9 5.8
FemaleHHPct 4.0 6.6 11.3
MedianIncomeHH1K 22.9 12.5 11.0
AllOcc2009Pct 16.0 11.3 12.5
PctVacHousing 6.5 4.9 9.7
PctBelowPoverty 5.1 9.6 14.0
PublicAssistancePct 4.1 9.9 14.0
$ConcDisadv_cluster <- mod.3$classification #13th in order
df2with(df2, table(ConcDisadv_cluster))
ConcDisadv_cluster
1 2 3
189 329 279
14:16] <- mod.3$z
df2[,names(df2)[14:16] <- c("ConcDisadv_prob1",
"ConcDisadv_prob2",
"ConcDisadv_prob3")
Concentrated Disadvantage Clustering Selection based on the highest probability
::paged_table(cbind(tract = df2[1:20,1],
rmarkdown`Concentrated Disadvantage` = df2[1:20,13],
1:20,14:16],
df2[1:20,2:12]),
df2[options = list(rows.print = 20, cols.print = 13))
With the level:
We can map on Chicago Neighborhood map (using ArcGIS) as such
<- read.csv("data/Chicago Census Data by Tract_ACS 2015.csv", header = TRUE) %>%
r_eth.df select(tract, PctBlack,PctAsian,PctHispanicLatino,PctWhiteNonHisp)
<- merge(df2, r_eth.df, by="tract") dff
mclust1Dplot(dff[,"PctBlack"], parameters = mod.3$parameters, z = mod.3$z,
what = "classification", main = FALSE)
title("Percent Black")
mclust1Dplot(dff[,"PctAsian"], parameters = mod.3$parameters, z = mod.3$z,
what = "classification", main = FALSE)
title("Percent Asian")
mclust1Dplot(dff[,"PctHispanicLatino"], parameters = mod.3$parameters, z = mod.3$z,
what = "classification", main = FALSE)
title("Percent Hispanic Latino")
mclust1Dplot(dff[,"PctWhiteNonHisp"], parameters = mod.3$parameters, z = mod.3$z,
what = "classification", main = FALSE)
title("Percent White")
We can see that:
Thus, neighborhoods with high percent Black focus on class 3 (the first graph). It’s quite contradict with high percent White (i.e. neighborhoods with high percent White focus on class 1, the last graph).
Now, how can we turn the visualization into an analysis.
I will discuss this in a future post.
Pugach, Oksana. 2018. “Latent Profile Analysis of Chicago Neighborhoods” Unpublished Work. Chalk Talk - Institute for Health Resereach; Policy - Methodology Research Core.
University of Virginia Library, Research Data Services + Sciences. “Getting Started with Factor Analysis” - https://data.library.virginia.edu/getting-started-with-factor-analysis/
Stats Stackexchange. “How to calculate the loading matrix from the score matrix and a data matrix X (PCA)?” - https://stats.stackexchange.com/questions/447952/how-to-calculate-the-loading-matrix-from-the-score-matrix-and-a-data-matrix-x-p
Luca Scrucca. 2020. “A quick tour of mclust” - https://cran.r-project.org/web/packages/mclust/vignettes/mclust.html
Luca Scrucca, Michael Fop, T. Brendan Murphy,Adrian E. Raftery. “mclust 5: Clustering, Classification and Density Estimation Using Gaussian Finite Mixture Models.” R J. 2016 August ; 8(1): 289-317 Footer
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/hai-mn/hai-mn.github.io, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Nguyen (2021, Oct. 24). HaiBiostat: What could we do if Latent Profile Analysis (LPA) run not as expected? -- LPA of Chicago Neighborhoods. Retrieved from https://hai-mn.github.io/posts/2021-10-24-LPA Chicago Neighborhoods/
BibTeX citation
@misc{nguyen2021what, author = {Nguyen, Hai}, title = {HaiBiostat: What could we do if Latent Profile Analysis (LPA) run not as expected? -- LPA of Chicago Neighborhoods}, url = {https://hai-mn.github.io/posts/2021-10-24-LPA Chicago Neighborhoods/}, year = {2021} }